Goto

Collaborating Authors

 feature dimension


Training step L0L1LT 1W Preprocessing f(x, v) T

Neural Information Processing Systems

In the following sections, we provide additional details about the network architecture, training, and experiments. The source code and WBC-SPH data set are published at https://github.com/ A.1 Implementation Details We implement our neural network with Tensorflow (https://www.tensorflow.org), They also serve as the basis for the implementation of our antisymmetric CConv (ASCC) layer. Axis for Mirroring As mentioned in the main text, the mirror axis for ASCC layers can be chosen freely while fulfilling the requirements from theory. This provides a degree of freedom for implementation. We decided to use a fixed axis, which in our case corresponds to the spatial y-axis. While the mirroring could potentially be coupled to the spatial content of features, we found that a single, fixed axis for mirroring simplifies the implementation of the ASCCs, and hence is preferable in practice. Additional Modifications In addition to the properties of our algorithm as discussed in Section 2.3 and the ablation study in Section 3, we normalize the input data depending on the given gravitational direction in the model.


A Proofs

Neural Information Processing Systems

Section A.1 presents the lemmas used to prove the main results. Section A.2 presents the main results The first two inequalities are owing to the triangle inequality, and the third inequality is due to the definition of L-divergence Eq.(5). We complete the proof by applying Lemma A.1 to bound F ollowing the conditions of Theorem 4.1, the upper bound of null V arnull null D Based on the conditions of Theorem 4.1, we assume We complete the proof by applying Lemma A.3 and Lemma A.4 to bound the Rademacher Following the proof of Theorem 4.1, we have |D F ollowing the conditions of Proposition 4.3, as N, we have, null D Based on the result on Proposition 4.3, for any ฮด (0, 1), we know that 4LB ( 2 D ln 2 + 1)null We complete the proof by applying the triangle inequality. III: Samples from p and q are labeled with 0 and 1, respectively. All values are averaged over five trials.



Supplementary Materials for " DropCov: A Simple yet Effective Method for Improving Deep Architectures " Qilong Wang

Neural Information Processing Systems

Our proposed DropCov can be flexibly integrated with existing deep architectures (e.g., CNNs [ Qinghua Hu is the corresponding author and is with Engineering Research Center of City intelligence and Digital Governance, Ministry of Education of the People's Republic of China. VGG-VD on three small-scale fine-grained datasets) show 0.5 is the best choices of As listed in Table S2, we can see that single L T module brings a little gain for plain GCP . Compared to B-CNN + L T (79.62% training accuracy), plain GCP GCP + L T, while B-CNN + L T achieves significant improvement over B-CNN and plain GCP . On the contrary, the samples involving less redundant information (e.g., scene) have large Such these phenomena show the consistency with our finding. Is second-order information helpful for large-scale visual recognition?




Algorithm 1: Pseudocode of PIC in a PyTorch-likestyle

Neural Information Processing Systems

LinearEvaluationProtocol Inlinear evaluation, wefollowthecommon setting [6,5]tofreeze the backbone of ResNet-50 and train a supervised linear classifier on the global average pooling features for100 epochs. Note that, the2-layer head inunsupervised pre-training isnotused inthe linear evaluation stage. During training, we augment the image with random scaling from 0.5 to 2.0, crop size of 769 and random flip. The top-1 and top-5 accuracyresults are reported inTable9. From the perspective of optimization goals, the only difference between the parametric instance classification framework and supervised classification framework is how to define the classes for each instance.


Addressing divergent representations from causal interventions on neural networks

arXiv.org Artificial Intelligence

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.


FedAPA: Federated Learning with Adaptive Prototype Aggregation Toward Heterogeneous Wi-Fi CSI-based Crowd Counting

arXiv.org Artificial Intelligence

Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.


Short-Range Oversquashing

arXiv.org Artificial Intelligence

Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques. In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks. We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.